Humans are sophisticated at reading interlocutors' emotions from multimodal signals, such as speech contents, voice tones and facial expressions. However, machines might struggle to understand various emotions due to the difficulty of effectively decoding emotions from the complex interactions between multimodal signals. In this paper, we propose a multimodal emotion analysis framework, InterMulti, to capture complex multimodal interactions from different views and identify emotions from multimodal signals. Our proposed framework decomposes signals of different modalities into three kinds of multimodal interaction representations, including a modality-full interaction representation, a modality-shared interaction representation, and three modality-specific interaction representations. Additionally, to balance the contribution of different modalities and learn a more informative latent interaction representation, we developed a novel Text-dominated Hierarchical High-order Fusion(THHF) module. THHF module reasonably integrates the above three kinds of representations into a comprehensive multimodal interaction representation. Extensive experimental results on widely used datasets, (i.e.) MOSEI, MOSI and IEMOCAP, demonstrate that our method outperforms the state-of-the-art.
translated by 谷歌翻译
Humans are skilled in reading the interlocutor's emotion from multimodal signals, including spoken words, simultaneous speech, and facial expressions. It is still a challenge to effectively decode emotions from the complex interactions of multimodal signals. In this paper, we design three kinds of multimodal latent representations to refine the emotion analysis process and capture complex multimodal interactions from different views, including a intact three-modal integrating representation, a modality-shared representation, and three modality-individual representations. Then, a modality-semantic hierarchical fusion is proposed to reasonably incorporate these representations into a comprehensive interaction representation. The experimental results demonstrate that our EffMulti outperforms the state-of-the-art methods. The compelling performance benefits from its well-designed framework with ease of implementation, lower computing complexity, and less trainable parameters.
translated by 谷歌翻译
Scene text images have different shapes and are subjected to various distortions, e.g. perspective distortions. To handle these challenges, the state-of-the-art methods rely on a rectification network, which is connected to the text recognition network. They form a linear pipeline which uses text rectification on all input images, even for images that can be recognized without it. Undoubtedly, the rectification network improves the overall text recognition performance. However, in some cases, the rectification network generates unnecessary distortions on images, resulting in incorrect predictions in images that would have otherwise been correct without it. In order to alleviate the unnecessary distortions, the portmanteauing of features is proposed. The portmanteau feature, inspired by the portmanteau word, is a feature containing information from both the original text image and the rectified image. To generate the portmanteau feature, a non-linear input pipeline with a block matrix initialization is presented. In this work, the transformer is chosen as the recognition network due to its utilization of attention and inherent parallelism, which can effectively handle the portmanteau feature. The proposed method is examined on 6 benchmarks and compared with 13 state-of-the-art methods. The experimental results show that the proposed method outperforms the state-of-the-art methods on various of the benchmarks.
translated by 谷歌翻译
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
与大脑变化相关的阿尔茨海默氏病(AD)和轻度认知障碍(MCI)的评估仍然是一项艰巨的任务。最近的研究表明,多模式成像技术的组合可以更好地反映病理特征,并有助于更准确地诊断AD和MCI。在本文中,我们提出了一种新型的基于张量的多模式特征选择和回归方法,用于诊断和生物标志物对正常对照组的AD和MCI鉴定。具体而言,我们利用张量结构来利用多模式数据中固有的高级相关信息,并研究多线性回归模型中的张量级稀疏性。我们使用三种成像方式(VBM- MRI,FDG-PET和AV45-PET)具有疾病严重程度和认知评分的临床参数来分析ADNI数据的方法的实际优势。实验结果表明,我们提出的方法与疾病诊断的最新方法的优越性能以及疾病特异性区域和与模态相关的差异的鉴定。这项工作的代码可在https://github.com/junfish/bios22上公开获得。
translated by 谷歌翻译
广义文本表示是许多自然语言理解任务的基础。要充分利用不同的语料库,不可避免地需要了解它们之间的相关性。但是,许多方法忽略了相关性,并直接用于所有任务的单通道模型(粗糙的范式),这缺乏足够的理性和解释。此外,一些现有的作品通过针迹技能块(一个精细的范式)学习下游任务,这可能会导致其冗余和噪音,从而导致非理性。在这项工作中,我们首先通过三种不同的观点分析任务相关性,即数据属性,手动设计和基于模型的相关性,基于相似的任务被分组在一起。然后,我们提出了一个用粗到细范式的层次结构框架,其最底层共享了所有任务,中层级别分为不同的组,以及分配给每个任务的顶级级别。这使我们的模型可以从所有任务中学习基本的语言属性,提高相关任务的性能,并减少不相关任务的负面影响。我们在五个自然语言理解任务的13个基准数据集上进行的实验证明了我们方法的优势。
translated by 谷歌翻译
在本文中,我们介绍了DA $^2 $,这是第一个大型双臂灵敏性吸引数据集,用于生成最佳的双人握把对,用于任意大型对象。该数据集包含大约900万的平行jaw grasps,由6000多个对象生成,每个对象都有各种抓紧敏度度量。此外,我们提出了一个端到端的双臂掌握评估模型,该模型在该数据集的渲染场景上训练。我们利用评估模型作为基准,通过在线分析和真实的机器人实验来显示这一新颖和非平凡数据集的价值。所有数据和相关的代码将在https://sites.google.com/view/da2dataset上开源。
translated by 谷歌翻译
在线操作检测是一旦在流视频中进行的操作,就可以预测该动作。一个主要的挑战是,该模型无法访问未来,并且必须仅依靠历史,即到目前为止观察到的框架来做出预测。因此,重要的是要强调历史的一部分,这些部分对当前框架的预测更有意义。我们提出了带有背景抑制的封闭历史单元的Gatehub,其中包括一种新颖的位置引导的封闭式跨注意机制,以增强或抑制历史的一部分,因为它们在当前框架预测方面的信息程度。 GateHub进一步建议未来的历史记录(FAH),通过使用后来观察到的框架,使历史特征更具信息性。在一个统一的框架中,GateHub集成了变压器的远程时间建模的能力以及经常性模型选择性编码相关信息的能力。 GateHub还引入了一个背景抑制目标,以进一步减轻与动作框架非常相似的误报背景框架。对三个基准数据集(Thumos,TVSeries和HDD)进行了广泛的验证,这表明GateHub显着胜过所有现有方法,并且比现有最佳工作更有效。此外,与所有需要RGB和光流信息进行预测的现有方法相比,GateHub的无流版本能够以2.8倍的帧速率获得更高或密切的精度。
translated by 谷歌翻译
行动预测旨在通过部分观察视频推断即将举行的人类行动,这是由于早期观察结果有限的信息有限。现有方法主要采用重建策略来处理此任务,期望从部分观察到完整视频来学习单个映射函数,以便于预测过程。在这项研究中,我们提出了来自两个新方面的部分视频查询生成“完整视频”功能调节的对抗性记忆网络(AMEMNet)。首先,键值结构化存储器发生器旨在将不同的部分视频存储为键存储器,并在具有门控机制和查询关注的值存储器中动态地写入完整视频。其次,我们开发了一个类感知判别者,以指导内存发生器在对抗训练时不仅提供现实,而且还提供鉴别的完整视频特征。通过RGB和光学流量的晚期融合给出了AMEMNET的最终预测结果。提供两个基准视频数据集,UCF-101和HMDB51的广泛实验结果,以证明所提出的AMEMNET模型在最先进的方法的有效性。
translated by 谷歌翻译
尽管在许多自然语言处理(NLP)任务中进行了预先训练的语言模型(LMS),但它们需要过多标记的数据来进行微调以实现令人满意的性能。为了提高标签效率,研究人员采取了活跃的学习(AL),而大多数事先工作则忽略未标记数据的潜力。要释放未标记数据的强大功能以获得更好的标签效率和模型性能,我们开发ATM,一个新的框架,它利用自我训练来利用未标记的数据,并且对于特定的AL算法不可知,用作改善现有的插件模块Al方法。具体地,具有高不确定性的未标记数据暴露于Oracle以进行注释,而具有低不确定性的人则可用于自培训。为了缓解自我训练中的标签噪声传播问题,我们设计一个简单且有效的基于动量的内存库,可以动态地从所有轮次汇总模型预测。通过广泛的实验,我们证明了ATM优于最强大的积极学习和自我训练基线,平均将标签效率提高51.9%。
translated by 谷歌翻译